R?RstudioR analyses in polished R markdown filesR resourcesRR installed
RR StudioR studioR studio environment by locating the following features:
R studio by clicking the top left icon- open a new R script.RMarkdownRMarkdownR chunks into Rmarkdown documentsR Markdown FilesRRcode chunks in RMarkdown# symbols[]R follows the normal priority of mathematical evaluation (PEDMAS)RInput code chunk and then output
4 * 4## [1] 16
Input code chunk and then output
(4 + 3 * 2^2)## [1] 16
<- operator.R is case sensitive.x <- 2
x * 3## [1] 6
y <- x * 3
y - 2## [1] 4
These do not work
3y <- 3
3*y <- 3x <- 12
x + 2## [1] 14
x^2## [1] 144
log(x)## [1] 2.484907
log - is a built in function of R, and therefore the object of the function needs to be put in parenthesesy <- 67
print(y)## [1] 67
x <- 124
z <- (x * y)^2
print(z)## [1] 69022864
c stands for concatenatex <- "I Love"
print(x)## [1] "I Love"
y <- "Biostatistics"
print(y)## [1] "Biostatistics"
z <- c(x, y)
print(z)## [1] "I Love" "Biostatistics"
R thinks in terms of vectors
R user to try to write scripts with that in mindn <- c(2, 3, 4, 2, 1, 2, 4, 5, 10, 8, 9)
print(n)## [1] 2 3 4 2 1 2 4 5 10 8 9
z <- n + 3
print(z)## [1] 5 6 7 5 4 5 7 8 13 11 12
x is now what is called a list of character values.factors, and we can redefine our character variables as factors.n_factor <- as.factor(n)
print(n_factor)
z_factor <- as.factor(z)
print(z_factor)str(n)
str(n_factor)
class(n)
class(n_factor)R “sees” a variable using str() or class() functions.int stands for integers
dbl stands for doubles, or real numbers
chr stands for character vectors, or strings
dttm stands for date-times (a date + a time)
lgl stands for logical, vectors that contain only TRUE or FALSE
fctr stands for factors, which R uses to represent categorical variables with fixed possible values
date stands for dates
FALSETRUENA which is ‘not available’.R numbers are doubles by default.NANaN which is ‘not a number’Inf-InfMany functions exist to operate on vectors.
mean(n)
median(n)
var(n)
log(n)
exp(n)
sqrt(n)
sum(n)
length(n)
sample(n, replace = T)sample) has an argument (replace=T)R and it is easy enough to write your own functions if none already exist to do what you want to do.-help(mean)
-`?`(mean)
-example(mean)
-help.search("mean")
-apropos("mean")
-args(mean)seqsampleseq_1 <- seq(0, 10, by = 0.1)
print(seq_1)## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3
## [15] 1.4 1.5 1.6 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7
## [29] 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1
## [43] 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.0 5.1 5.2 5.3 5.4 5.5
## [57] 5.6 5.7 5.8 5.9 6.0 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9
## [71] 7.0 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8.0 8.1 8.2 8.3
## [85] 8.4 8.5 8.6 8.7 8.8 8.9 9.0 9.1 9.2 9.3 9.4 9.5 9.6 9.7
## [99] 9.8 9.9 10.0
seq_2 <- seq(10, 0, by = -0.1)
print(seq_2)## [1] 10.0 9.9 9.8 9.7 9.6 9.5 9.4 9.3 9.2 9.1 9.0 8.9 8.8 8.7
## [15] 8.6 8.5 8.4 8.3 8.2 8.1 8.0 7.9 7.8 7.7 7.6 7.5 7.4 7.3
## [29] 7.2 7.1 7.0 6.9 6.8 6.7 6.6 6.5 6.4 6.3 6.2 6.1 6.0 5.9
## [43] 5.8 5.7 5.6 5.5 5.4 5.3 5.2 5.1 5.0 4.9 4.8 4.7 4.6 4.5
## [57] 4.4 4.3 4.2 4.1 4.0 3.9 3.8 3.7 3.6 3.5 3.4 3.3 3.2 3.1
## [71] 3.0 2.9 2.8 2.7 2.6 2.5 2.4 2.3 2.2 2.1 2.0 1.9 1.8 1.7
## [85] 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3
## [99] 0.2 0.1 0.0
seq_square <- (seq_2) * (seq_2)
print(seq_square)## [1] 100.00 98.01 96.04 94.09 92.16 90.25 88.36 86.49 84.64 82.81
## [11] 81.00 79.21 77.44 75.69 73.96 72.25 70.56 68.89 67.24 65.61
## [21] 64.00 62.41 60.84 59.29 57.76 56.25 54.76 53.29 51.84 50.41
## [31] 49.00 47.61 46.24 44.89 43.56 42.25 40.96 39.69 38.44 37.21
## [41] 36.00 34.81 33.64 32.49 31.36 30.25 29.16 28.09 27.04 26.01
## [51] 25.00 24.01 23.04 22.09 21.16 20.25 19.36 18.49 17.64 16.81
## [61] 16.00 15.21 14.44 13.69 12.96 12.25 11.56 10.89 10.24 9.61
## [71] 9.00 8.41 7.84 7.29 6.76 6.25 5.76 5.29 4.84 4.41
## [81] 4.00 3.61 3.24 2.89 2.56 2.25 1.96 1.69 1.44 1.21
## [91] 1.00 0.81 0.64 0.49 0.36 0.25 0.16 0.09 0.04 0.01
## [101] 0.00
seq_square_new <- (seq_2)^2
print(seq_square_new)## [1] 100.00 98.01 96.04 94.09 92.16 90.25 88.36 86.49 84.64 82.81
## [11] 81.00 79.21 77.44 75.69 73.96 72.25 70.56 68.89 67.24 65.61
## [21] 64.00 62.41 60.84 59.29 57.76 56.25 54.76 53.29 51.84 50.41
## [31] 49.00 47.61 46.24 44.89 43.56 42.25 40.96 39.69 38.44 37.21
## [41] 36.00 34.81 33.64 32.49 31.36 30.25 29.16 28.09 27.04 26.01
## [51] 25.00 24.01 23.04 22.09 21.16 20.25 19.36 18.49 17.64 16.81
## [61] 16.00 15.21 14.44 13.69 12.96 12.25 11.56 10.89 10.24 9.61
## [71] 9.00 8.41 7.84 7.29 6.76 6.25 5.76 5.29 4.84 4.41
## [81] 4.00 3.61 3.24 2.89 2.56 2.25 1.96 1.69 1.44 1.21
## [91] 1.00 0.81 0.64 0.49 0.36 0.25 0.16 0.09 0.04 0.01
## [101] 0.00
x <- rnorm(n = 10000, mean = 0, sd = 10)
y <- sample(1:10000, 10000, replace = T)
xy <- cbind(x, y)
plot(xy)x <- rnorm(10000, 0, 10)
y <- sample(1:10000, 10000, replace = T)
xy <- cbind(x, y)
hist(x)dnorm() generates the probability density, which can be plotted using the curve() function.add=TRUEx <- rnorm(1000, 0, 100)
hist(x, xlim = c(-500, 500))
curve(50000 * dnorm(x, 0, 100), xlim = c(-500, 500), add = TRUE,
col = "Red")Rhist function.plot function (as well as a number of more sophisticated plotting functions).high level plotting function, which sets the stageLow level plotting functions will tweak the plots and make them beautifulseq_1 <- seq(0, 10, by = 0.1)
plot(seq_1, xlab = "space", ylab = "function of space", type = "p",
col = "red")seq_1 <- seq(0, 10, by = 0.1)
seq_2 <- seq(10, 0, by = -0.1)par(mfrow = c(2, 2))
plot(seq_1, xlab = "time", ylab = "p in population 1", type = "p",
col = "red")
plot(seq_2, xlab = "time", ylab = "p in population 2", type = "p",
col = "green")
plot(seq_square, xlab = "time", ylab = "p2 in population 2",
type = "p", col = "blue")
plot(seq_square_new, xlab = "time", ylab = "p in population 1",
type = "l", col = "yellow")Complete Excercises 1.3-1.8
RRR you can generate your own random data set drawn from nearly any distribution very easily.habitat <- factor(c("mixed", "wet", "wet", "wet", "dry", "dry",
"dry", "mixed"))
temp <- c(3.4, 3.4, 8.4, 3, 5.6, 8.1, 8.3, 4.5)
elevation <- c(0, 9.2, 3.8, 5, 5.6, 4.1, 7.1, 5.3)mydata <- data.frame(habitat, temp, elevation)
row.names(mydata) <- c("Reedy Lake", "Pearcadale", "Warneet",
"Cranbourne", "Lysterfield", "Red Hill", "Devilbend", "Olinda")
head(mydata)## habitat temp elevation
## Reedy Lake mixed 3.4 0.0
## Pearcadale wet 3.4 9.2
## Warneet wet 8.4 3.8
## Cranbourne wet 3.0 5.0
## Lysterfield dry 5.6 5.6
## Red Hill dry 8.1 4.1
R is being able to import data from an external source
R.R look in the CWD, whereas a full path can also be usedYourFile <- read.table("yourfile.csv", header = T, row.names = 1,
sep = ",")
YourFile <- read.csv("yourfile.csv", header = T, row.names = 1,
sep = ",")
YourFile <- read.table("yourfile.txt", header = T, row.names = 1,
sep = "\t")write.csv(YourFile, "yourfile.csv", quote = F, row.names = T,
sep = ",")
write.table(YourFile, "yourfile.txt", quote = F, row.names = T,
sep = "\t")print(YourFile[, 2])
print(YourFile$temp)
print(YourFile[2, ])
plot(YourFile$temp, YourFile$elevation)x <- (YourFile[, 2])
y <- (YourFile$temp)
z <- (YourFile$elevation)
plot(y, z)tapply(YourFile$temp, YourFile$habitat, mean)
tapply(YourFile$temp, YourFile$habitat, var)Complete Exercises 1.9-1.10
git clone https://github.com/wcresko/evomics_stat_2019.gitgit status
git merge origin/masterknit button to render markdown*Italic* or _Italic_
**Bold** or __Bold__> "You know the greatest danger facing us is ourselves, an irrational fear of the unknown.
But there’s no such thing as the unknown — only things temporarily hidden, temporarily not understood."
>
> --- Captain James T. Kirk“You know the greatest danger facing us is ourselves, an irrational fear of the unknown. But there’s no such thing as the unknown — only things temporarily hidden, temporarily not understood.”
— Captain James T. Kirk
-list_element
-sub_list_element
-sub_list_element
-sub_list_element
-list_element
-sub_list_element1. One
2. Two
3. Three
4. Four$$ \large a^x, \sqrt[n]{x}, \vec{\jmath}, \tilde{\imath}$$\[ \large a^x, \sqrt[n]{x}, \vec{\jmath}, \tilde{\imath}\]
$$ \large \alpha, \beta, \gamma$$\[ \large \alpha, \beta, \gamma\]
$$ \large\approx, \neq, \nsim $$\[ \large\approx, \neq, \nsim \]
$$\large \partial, \mathbb{R}, \flat$$\[\large \partial, \mathbb{R}, \flat\]
Binomial sampling equation
$$\large f(k) = {n \choose k} p^{k} (1-p)^{n-k}$$\[\large f(k) = {n \choose k} p^{k} (1-p)^{n-k}\]
Poisson Sampling Equation
$$\large Pr(Y=r) = \frac{e^{-\mu}\mu^r}{r!}$$\[\large Pr(Y=r) = \frac{e^{-\mu}\mu^r}{r!}\]
$$\iint xy^2\,dx\,dy =\frac{1}{6}x^2y^3$$\[\iint xy^2\,dx\,dy =\frac{1}{6}x^2y^3\]
$$ \begin{matrix}
-2 & 1 & 0 & 0 & \cdots & 0 \\
1 & -2 & 1 & 0 & \cdots & 0 \\
0 & 1 & -2 & 1 & \cdots & 0 \\
0 & 0 & 1 & -2 & \ddots & \vdots \\
\vdots & \vdots & \vdots & \ddots & \ddots & 1 \\
0 & 0 & 0 & \cdots & 1 & -2
\end{matrix} $$\[ \begin{matrix} -2 & 1 & 0 & 0 & \cdots & 0 \\ 1 & -2 & 1 & 0 & \cdots & 0 \\ 0 & 1 & -2 & 1 & \cdots & 0 \\ 0 & 0 & 1 & -2 & \ddots & \vdots \\ \vdots & \vdots & \vdots & \ddots & \ddots & 1 \\ 0 & 0 & 0 & \cdots & 1 & -2 \end{matrix} \]
This equation, $y=\frac{1}{2}$, is included inlineThis equation, \(y=\frac{1}{2}\), is included inline
Whereas this equation, $$y=\frac{1}{2}$$, is put on a separate lineWhereas this equation \[y=\frac{1}{2}\] is put on a separate line
Say you perform an experiment on two different strains of stickleback fish, one from an ocean population (RS) and one from a freshwater lake (BP) by making them microbe free. Microbes in the gut are known to interact with the gut epithelium in ways that lead to a proper maturation of the immune system.
You carry out an experiment by treating multiple fish from each strain so that some of them have a conventional microbiota, and some are inoculated with only one bacterial species. You then measure the levels of gene expression in the stickleback gut using RNA-seq. You suspect that the sex of the fish might be important so you track it too.
Hadley Wickham and others have written R packages to modify data
These packages do many of the same things as base functions in R
However, they are specifically designed to do them faster and more easily
Wickham also wrote the package GGPlot2 for elegant graphics creations
GG stands for ‘Grammar of Graphics’
dplyr for vectorsfilter().arrange().select().mutate().summarise().filter(), arrange() & select()filter(flights, month == 11 | month == 12)arrange(flights, year, month, day)select(flights, year, month, day)mutate() & transmutate()This function will add a new variable that is a function of other variable(s)
mutate(flights_sml, gain = arr_delay - dep_delay, hours = air_time/60,
gain_per_hour = gain/hours)This function will replace the old variable with the new variable
transmute(flights, gain = arr_delay - dep_delay, hours = air_time/60,
gain_per_hour = gain/hours)group_by( ) & summarize( )This first function allows you to aggregate data by values of categorical variables (factors)
by_day <- group_by(flights, year, month, day)Once you have done this aggregation, you can then calculate values (in this case the mean) of other variables split by the new aggregated levels of the categorical variable
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))group_by( ) & summarize( )xxx
“Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space”
— Edward Tufte
Draw graphical elements clearly, minimizing clutter
Represent magnitudes honestly and accurately
“Graphical excellence begins with telling the truth about the data” – Tufte 1983
geom_bar functionlibrary(ggplot2)
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))geom_bar functionNow try this…
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, colour = cut))geom_bar functionand this…
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = cut))geom_bar functionand finally this…
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity),
position = "dodge")geom_histogram and geom_freqpoly functionWith this function you can make a histogram
ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat),
binwidth = 0.5)geom_histogram and geom_freqpoly functionThis allows you to make a frequency polygram
ggplot(data = diamonds) + geom_freqpoly(mapping = aes(x = carat),
binwidth = 0.5)geom_boxplot functionBoxplots are very useful for visualizing data
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()geom_boxplot functionggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot() +
coord_flip()geom_boxplot functionggplot(data = mpg, mapping = aes(x = reorder(class, hwy, FUN = median),
y = hwy)) + geom_boxplot() + coord_flip()geom_point & geom_smooth functionsggplot(data = diamonds, mapping = aes(x = x, y = y)) + geom_point()geom_point & geom_smooth functionsggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~class, nrow = 2)geom_point & geom_smooth functionsggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ cyl)geom_point & geom_smooth functionsggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy),
method = "loess")ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy), method = "loess")ggplot(data = mpg, aes(displ, hwy)) + geom_point(aes(color = class)) +
geom_smooth(se = FALSE, method = "loess") + labs(title = "Fuel efficiency generally decreases with engine size",
subtitle = "Two seaters (sports cars) are an exception because of their light weight",
caption = "Data from fueleconomy.gov")